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Abstract 

The ongoing functional annotation of proteins relies upon the work of curators to capture 
experimental findings from scientific literature and apply them to protein sequence and 
structure data. However, with the increasing use of high-throughput experimental assays, 
a small number of experimental studies dominate the functional protein annotations col- 
lected in databases. Here we investigate just how prevalent is the "few articles - many 
proteins" phenomenon. We examine the experimentally validated annotation of proteins 
provided by several groups in the GO Consortium, and show that the distribution of pro- 
teins per published study is exponential, with 0.14% of articles providing the source of 
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10 annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the 

11 dominant articles describes the use of an assay that can find only one function or a small 

12 group of functions, this leads to substantial biases in what we know about the function 

13 of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high 

14 throughput experiments. Consequently, the functional information derived from these 

15 experiments is mostly of the subcellular location of proteins, and of the participation 

16 of proteins in embryonic developmental pathways. For some organisms, the information 

17 provided by different studies overlap by a large amount. We also show that the informa- 

18 tion provided by high throughput experiments is less specific than those provided by low 

19 throughput experiments. Given the experimental techniques available, certain biases in 

20 protein function annotation due to high-throughput experiments are unavoidable. Know- 

21 ing that these biases exist and understanding their characteristics and extent is important 

22 for database curators, developers of function annotation programs, and anyone who uses 

23 protein function annotation data to plan experiments. 



24 Author Summary 

25 Experiments and observations are the vehicles used by science to understand the world 

26 around us. In the field of molecular biology, we are increasingly relying on high-throughput, 

27 genome-wide experiments to provide answers about the function of biological macro- 

28 molecules. However, any experimental assay is essentially limited in the type of infor- 

29 mation it can discover. Here we show that our increasing reliance on high-throughput 

30 experiments biases our understanding of protein function. While the primary source of 

31 information is experiments, the functions of many proteins are computationally annotated 

32 by sequence-based similarity, either directly or indirectly, to proteins whose function is 
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33 experimentally determined. Therefore, any biases in experimental annotations can get 

34 amplified and entrenched in the majority of protein databases. We show here that high- 

35 throughput studies are biased towards certain aspects of protein function, and that they 

36 provide less information than low-throughput studies. While there is no clear solution to 

37 the phenomenon of bias from high-throughput experiments, recognizing its existence and 

38 its impact can help take steps to mitigate its effect. 

39 Introduction 

40 Functional annotation of proteins is an open problem and a primary challenge in molecular 

41 biology today [IHl]- The ongoing improvements in sequencing technology have shifted 

42 the emphasis shifting from realizing the $1,000 genome to realizing the 1-hour genome [3]. 

43 The ability to rapidly and cheaply sequence genomes is creating a flood of sequence 

44 data, but to make these data useful, extensive analysis is needed. A large portion of 

45 this analysis involves assigning biological function to newly determined gene sequences, 

46 a process that is both complex and costly [6]. To aid current annotation procedures and 

47 improve computational function prediction algorithms, high-quality and experimentally 

48 derived data are necessary. Currently, one of the few repositories of such data is the 

49 UniProt-GOA database [7], which is a compilation of data contributed by several member 

50 groups of the GO consortium. UniProt-GOA contains functional information derived 

51 from literature, and by computational means. The information derived from literature is 

52 extracted by human curators who capture functional data from publications, assign the 

53 data to their appropriate place in the Gene Ontology hierarchy [8] and label them with 

54 appropriate functional evidence codes. UniProt-GOA is compiled from annotations made 

55 by several member groups of the GO consortium, and as such presents the current state of 
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56 our view of protein function space. It is therefore important to understand any trends and 

57 biases that are encapsulated in UniProt-GOA, as those impact well-used sister databases 

58 and consequently a large number of users worldwide. 

59 One concern surrounding the capture of functional data from articles is the propensity 

60 for high-throughput experimental work to become a large fraction of the data in the GO 

61 Consortium database, thus having a small number of experiments dominate the protein 

62 function landscape. In this work we analyzed the relative contribution of peer-reviewed 

63 articles describing all the experimentally derived annotations in UniProt-GOA. We found 

64 some striking trends, stemming from the fact that a small fraction of articles describing 

65 high-throughput experiments disproportionately contribute to the pool of experimental 

66 annotations of model organisms. Consequently we show that: 1) annotations coming 

67 from high-throughput experiments are overall less informative than those provided by 

68 low-throughput experiments; 2) annotations from high-throughput experiments are biased 

69 towards a limited number of functions, and, 3) many high-throughput experiments overlap 

70 in the proteins they annotate, and in the annotations assigned. Taken together, our 

71 findings offer a picture of how the protein function annotation landscape is generated 

72 from scientific literature. Furthermore, due to the biases inherent in the current system 

73 of sequence annotations, this study serves as a caution to the producers and consumers 

74 of biological data from high-throughput experiments. 

75 Results 

76 Articles and Proteins 

77 The increase in the number of high-throughput experiments used to determine protein 

78 functions may introduce biases into experimental protein annotations, due to the inher- 
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79 ent capabilities and limitations of high-throughput assays. To test the hypothesis that 

BO such biases exist, and to study their extent if they do, we compiled the details of all 

81 experimentally annotated proteins in UniProt-GOA. This included all proteins whose GO 

82 annotations have the GO experimental evidence codes EXP, IDA, IPI, IMP, IGI, lEP (See 

83 Methods for an explanation of GO evidence codes). We first examined the distribution 
B4 of articles that are the source of experimentally validated annotations by the number of 
B5 proteins they annotate. As can be seen in Figure [H the distribution of the number of 

86 proteins annotated per article follows a power-law distribution. f{x) = ax^. Using lin- 

87 ear regression over the log values of the axes we obtained a fit with p < 1.18 x 10~^ and 

88 = —0.72. We therefore conclude that there is indeed a substantial bias in experimental 

89 annotations, in which there are few articles that annotate a large number of proteins. 

90 To better understand the consequences of such a distribution, we divided the anno- 

91 fating articles into four cohorts, based on the number of proteins each article annotates. 

92 Single-throughput articles are those articles that annotate only one protein; low through- 

93 put articles annotate 2-9 proteins; moderate throughput articles annotate 10-99 proteins 

94 and high throughput articles annotate over 99 proteins. The results are shown in Table [1] 

95 The most striking finding is that high throughput articles are responsible for 25% of the 

96 annotations that the GO Consortium creates, even though they are found only in 0.14% of 

97 the articles. 96% of the articles are single-throughput and low-throughput, however those 

98 annotate only 53% of the proteins. So while moderate-throughput and high-throughput 

99 studies account for almost 47% of the annotations in Uniprot-GOA, they constitute only 

100 3.66% of the studies published. 

101 To understand how the log-odds distribution affects our understanding of protein func- 

102 tion, we examined different aspects of the annotations in the four article cohorts. Also, 

103 we examined in greater detail the top-50 high-throughput annotating articles. "Top-50 
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104 high throughput annotating articles" are the articles describing experimental annotations 

105 that are top ranked by the number of proteins annotated per article. An initial charac- 

106 terization of these articles is shown in Table [STl As can be seen, most of the articles are 

107 specific to a single species (typically a model organism) and to a single assaying pipeline 

108 that is used to assign function to the proteins in that organism. With one exception, only 

109 one ontology of the three GO ontologies was used for annotation in any single experiment, 
no The three ontologies are Molecular Function (MF), Biological Process (BP) and Cellular 

111 Component (CC). These are separate ontologies within GO, describing different aspects 

112 of function as detailed in [8]. As we show later, for some species this means that a single 

113 functional aspect (MF, BP or CC) of a species can be dominated by a single study. 

114 The Impact of High Throughput Studies on the Annotation of 

115 Model Organisms 

116 We examined the relative contribution of the top-50 articles to the entire corpus of ex- 

117 perimentally annotated proteins in each species. Unsurprisingly, all the species found in 

118 the top-50 articles were either common model organisms or human. For each species, 

119 we examined the five most frequent terms in the top-50 articles. We then examined 

120 the contribution of this term by the top-50 articles to the general annotations of that 

121 species. The contribution is the number of annotations by any given GO term in the 

122 top 50 articles divided by the number of annotations by that GO term in all of UniProt- 

123 GOA. For example, as seen in Figure [2] in D. melanogaster 88% of the annotations using 

124 the term "precatalytic splicosome" in articles experimentally annotating this species are 

125 contributed by the top-50 articles. 

126 For most organisms annotated by the top-50 articles, the annotations were within the 
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127 cellular component or biological process ontologies. Notable exceptions are D. melanogaster 

128 and C. elegans where the dominant terms were from the Biological Process ontology, and 

129 in mouse, where "protein binding" and "identical protein binding" are from the Molecular 

130 Function Ontology. D. melanogaster^ s annotation for the top terms is dominated (over 

131 50% contribution) by the top-50 articles. 

132 The term frequency bias described here can be viewed more broadly within the ontol- 

133 ogy bias. The proteins annotated by the cohorts of single-protein articles, low-throughput 

134 articles, and moderate-throughput articles have similar ratios of the fraction of proteins 

135 annotated. Twenty-two to twenty-six percent of assigned terms are in the Molecular 

136 Function Ontology, and 51-57% are in the Biological Process Ontology and the remaining 

137 17-25% are in the Cellular Component ontology. These ratios change dramatically with 

138 high-throughput articles (over 99 terms per article). In the high-throughput articles, only 

139 5% of assigned terms are in the Molecular Function Ontology, 38% in the Biological Pro- 

140 cess Ontology and 57% in the Cellular Compartment Ontology, ostensibly due to a lack of 

141 high-throughput assays that can be used for generating annotations using the Molecular 

142 Function Ontology. 

143 Repetition and Consistency in Top-50 Annotations 

144 How many of the top-50 articles actually annotate the same set of proteins? Answering 

145 this question will tell us how repetitive experiments are in identifying the same set of 

146 proteins to annotate. However, even when annotating the same set of proteins and within 

147 the same ontology, different experiments may provide different results, lacking consistency. 

148 Therefore, the annotation consistency was also checked. Repetition is given as with n 

149 being the number of proteins annotated by two or more articles, and being the total 

150 number of proteins. The results of the repetition analysis are shown in Figure [3] and in 
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151 Table [2l As can be seen, the highest repetition (65%) is in the 12 articles annotating 

152 C. elegans. Of course, a higher number of articles is expected to increase repetitive 

153 annotations simply due to increased sampling of the genome. However, the goal of this 

154 analysis is to present the degree of repetition, rather than to try to rank and normalize 

155 it. As an additional repetition metric. Table |2] also lists the mean number of sequences 

156 per cluster. When normalized by number of annotating articles, the highest repetition 

157 is found in Mouse (15.33% in three articles) closely followed by M. tuberculosis (14% in 

158 two articles). Taken together, these results show that there is repetition in choosing the 

159 proteins that are to be annotated in most model organisms using high-throughput assays, 

160 although the rate of this repetition varies widely. 

161 Consistency analysis took place as described in Methods. The consistency measure 

162 is normalized on a 0-1 scale, with 1 being most consistent, meaning that all annotations 

163 from all sources are identical. Table [3] shows the results of this analysis. In A. thaliana, 

164 1941 proteins are annotated by 15 articles and 18 terms in the Cellular Component on- 

165 tology. The mean maximum-consistency is 0.251. The highest mean consistency is for 

166 the annotation of 807 mouse proteins annotated in Cellular Component ontology with 

167 an annotation consistency 0.832. However, that is not surprising given that there are 

168 only three annotating articles, and two annotating terms. We omitted the ontology and 

169 organism combinations that were annotated by less than three articles or two GO terms, 

170 or both. 

171 Quantifying Annotation Information 

172 A common assumption holds that while high-throughput experiments do annotate more 

173 protein functions than low-throughput experiments, the former also tend to be more 

174 shallow in the predictions they provide. The information provided, for example, by a 
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175 large-scale protein binding assay will only tell us if two proteins are binding, but will 

176 not reveal whether that binding is specific, will not provide an exact Khind, will not say 

177 under what conditions binding takes place, or whether there is any enzymatic reaction 

178 or signal-transduction involved. Having on hand data from experiments with different 

179 "throughputness" levels, we set out to investigate whether there is indeed a difference in 
iBo the information provided by high-throughput experiments vs. low-throughput ones. We 
iBi examined the information provided by GO terms in each paper cohort using two methods: 

182 edge-count, and information-content. See Methods for details. 

183 The results of both analyses are shown in Figure |H In general, the results from the 

184 edge count analysis and the information-content based analysis are in agreement when 

185 compared across annotation cohorts. For the Molecular Function ontology, the distribu- 

186 tion of edge counts and log-frequency scores decreases as the number of annotated proteins 

187 per-article increases. For the Biological Process ontology, the decrease is significant. How- 

188 ever the contributors to the decrease are the high-throughput articles while there is little 

189 change in the first three article cohorts. Finally, there is no significant trend of GO-depth 

190 decrease in the Cellular Component Ontology. However, using the information-content 

191 metric, there is also a significant decrease in information-content in the high-throughput 

192 article cohort. 

193 Exclusive High Throughput Annotations 

194 Of interest is the fraction of proteins that are exclusively annotated by high-throughput 

195 experiments. The question here is: from the experimentally annotated proteins in an or- 

196 ganism, how much do we know of their function only using high-throughput experiments? 

197 We have seen that high-throughput experiments annotate a large number of proteins, but 

198 still some 80% of experimentally determined proteins are annotated via medium-, low- and 
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199 single-throughput experiments. Given the lower information-content of high-throughput 

200 experiments, it is important to know which organisms have a substantial fraction of the 

201 proteins experimentally annotated by high throughput studies only. To do so, we ana- 

202 lyzed all species with more than 200 genes in the NCBI taxa database for the fraction 

203 of the genes that are exclusively annotated by high throughput studies. The results are 

204 shown in Table HI 

205 As can be seen, although the fraction of high-throughput annotated proteins is large, 

206 not many species are affected with a large fraction of proteins that are exclusively anno- 

207 tated by high-throughput studies. However, the few species that are affected are important 

208 study and model species. It is important to note that some redundancy due to isoforms, 

209 mutants and duplications may exist. 

210 Frequently Used High-Throughput Experiments 

211 The twenty GO evidence codes, discussed above, encapsulate the means by which the 

212 function was inferred, but they do not capture all the necessary information. For example, 

213 "Inferred by Direct Assay" (IDA) informs that some experimental assay was used, but 

214 does not say which type of assay. This information is often needed, since knowing which 

215 experiments were performed can help the researcher establish the reliability and scope 

216 of the produced data. RNA, used in an RNAi experiment does not traverse the blood- 

217 brain-barrier, meaning that no data from the central nervous system can be drawn from an 

218 RNAi experiment. The Evidence Code Ontology, or ECO, seeks to improve upon the GO- 

219 attached evidence codes. ECO provides more elaborate terms than "Inferred by Direct 

220 Assay": ECO also conveys which assay was used, for example "microscopy" or "RNA 

221 interference" . In addition to evidence terms, the ECO ontology provides assertion terms 

222 in which the nature of the assay is given. For example, an enzyme-linked immunosorbent 
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223 assay (ELISA) provides quantitative protein data in vitro while an immunogold assay may 

224 provide the same information, and cellular localization information in situ. We manually 

225 assigned Evidence Codes Ontology (ECO) assertion and evidence terms to the top-50 

226 articles. The assignment is shown in detail in Table [S2l Table [S3] shows the sorted count 

227 of ECO terms in the top-50 papers. 

228 The most frequent ECO term used is ECO:0000160 "protein separation followed by 

229 fragment identification evidence": this fits the 27 papers that essentially describe mass- 

230 spectrometry studies. Consequently this means that the assignment procedure is limited 

231 to the cellular compartments that can be identified with the fractionation methods used. 

232 So while Cellular Component is the most frequent annotation used, fractionation and 

233 mass-spectrometry is the most common method used to localize proteins in subcellu- 

234 lar compartments. A notable exception to the use of fractionation and MS for protein 

235 localization is in the top annotating article [9], which uses microscopy for subcellular 

236 localization. 

237 The second most frequent experimental ECO term is "Imaging assay evidence" (ECO:000044). 

238 Several types of studies fall under this ECO. Those include microscopy, RNAi, some of the 

239 mass-spectrometry studies that used microscopy, and a yeast-2-hybrid study. As imaging 

240 information is used in a variety of studies, this ECO term is not informative of the chief 

241 method used in any study, but rather the importance of imaging assays in a variety of 

242 methods. The third most frequent experimental ECO term used was "Cell fractionation 

243 evidence" which is closely associated with the top term, "Imaging assay evidence". The 

244 fourth annd fifth most frequent ECO term used were "loss-of-function mutant phenotype 

245 evidence" (ECO:0000016) and "RNAi evidence" (ECO:000019). These two terms are also 

246 closely associated, in RNAi whole-genome gene knockdowns in C. elegans, D. melanogaster 

247 and one in C. albicans. RNAi experiments use targeted dsRNA which is delivered to the 
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248 organism and silences specific genes. Typically the experiments here used libraries of 

249 RNAi targeted to the whole exome (for example [TUHT3]). The phenotypes searched for 

250 were mostly associated with embryonic and post-embryonic development. Some studies 

251 focused on mitotic spindle assembly [l^ , lipid storage [15] and endocytic traffic [16] . One 

252 study used RNAi to identify mitochondrial protein localization [T?]. These studies mostly 

253 use the same RNAi libraries, and target the whole C. elegans genome using common data 

254 resources. Hence the large redundancy observed for C. elegans in Table [21 It should be 

255 noted that all experiments are associated with computational ECO terms, which describe 

256 sequence similarity and motif recognition techniques used to identify the sequences found: 

257 "sequence similarity evidence", "transmembrane domain prediction evidence", "protein 

258 BLAST evidence" etc. These terms are all bolded in Table [S3l A strong reliance on 

259 computational annotation is therefore an integral part of high throughput experiments. 

260 It should be noted that computational annotation here is not used directly for functional 

261 annotation, but rather for identifying the protein by a sequence or motif similarity search. 

262 The third most frequently used assertion in the top experimental articles was not an exper- 

263 imental assertion, but rather a computational one: the term ECO:00053 "computational 

264 combinatorial evidence" is defined as "A type of combinatorial analysis where data are 

265 combined and evaluated by an algorithm." This is not a computational prediction per se, 

266 but rather a combination of several experimental lines of evidence used in a article. 

267 Discussion 

268 We have identified several annotation biases in GO annotations provided by the GO 

269 consortium. These biases stem from the uneven number of annotations produced by dif- 

270 ferent types of experiments. It is clear that results from high-throughput experiments 
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271 contribute substantially to the function annotation landscape, as up to 20% of experi- 

272 mentally annotated proteins are annotated by high-throughput assays. At the same time, 

273 high throughput experiments produce less information per protein than moderate-, low- 

274 and single- throughput experiments as evidenced by the type of GO terms produced in 

275 the Molecular Function and Biological Process ontologies. Furthermore, the number of 

276 total GO terms used in the high-throughput experiments is much lower than that used in 

277 low and medium throughput experiments. Therefore, while high throughput experiments 

278 provide a high coverage of protein function space, it is the low throughput experiments 

279 that provide more specific information, as well as a larger diversity of terms. 

280 We have also identified several types of biases that are contributed by high throughput 

281 experiments. First, there is the enrichment of low-information-content GO terms, which 

282 means that our understanding of the protein function as provided by high-throughput 

283 experiments is more limited than that provided by low-throughput experiments. Second, 

284 there is the small number of terms used, when considering the large number of proteins 

285 that are being annotated. Third is the general ontology bias towards the cellular com- 

286 ponent ontology and, to a lesser extent, the Biological Process ontology: there are very 

287 few articles that deal with the Molecular Function ontology. These biases all stem from 

288 the inherent capabilities and limitations of the hight-throughput experiments. A fourth, 

289 related bias is the organism studied: taken together, studies of C. elegans and A. thaliana 

290 studies comprise 36 of the top-50 annotating articles, or 72%. 

291 Information Capture and Scope of GO 

292 We have discussed the information loss that is characteristic of high-throughput experi- 

293 ments, as shown in Figure HI However, another reason for information loss is the inability 

294 to capture certain types of information using the Gene Ontology. GO is purposefully 



14 



295 limited to three aspects (MF, BP and CC) of biological function, which are assigned per 

296 protein. However, other aspects of function may emerge from experiments. Of note is 

297 the study, "Proteome survey reveals modularity of the yeast cell machinery" [9] . In this 

298 study, the information produced was primarily of protein complexes, and the relationship 

299 to cellular compartmentalization and biological networks. At the same time, the only GO 

300 term captured in the curation of proteins from this study was "protein binding" . Some, 

301 but not all of this information can be captured more specifically using the children of 

302 the term "protein binding" , but such a process is arguably laborious by manual curation 

303 of the information from a high throughput article. Furthermore, the main information 

304 conveyed by this article, namely the types of protein complexes discovered and how they 

305 relate to cellular networks, is outside the scope of GO. It is important to realize that while 

306 high-throughput experiments do convey less information per protein within the functional 

307 scope as defined by GO, they still convey composite information such as possible pathway 

308 mappings - information which needs to be captured into annotation databases by means 

309 other than GO. In the example above, the information can be captured by a protein in- 

310 teraction database, but not by GO terms. Methods such as the Statistical Tracking of 

311 Ontological Phrases [Hj can help in selecting the appropriate ontology for better infor- 

312 mation capture. 

313 Conclusions 

314 Taken together, the annotation trends in high-throughput studies affect our understand- 

315 ing of protein function space. This, in turn, affects our ability to properly understand the 

316 connection between predictors of protein function and the actual function - the hallmark 

317 of computational function annotation. As a dramatic example, during the 2011 Critical 

318 Assessment of Function Annotation experiment [19] it was noticed that roughly 20% of 
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319 the proteins participating in the challenge and annotated with the Molecular Function 

320 Ontology were annotated as "protein binding", a GO term that conveys little informa- 

321 tion. Furthermore, it was shown that the major contribution of "protein binding" term 

322 to the CAFA challenge data set was due to high-throughput assays. This illustrates how 

323 the concentration of a large number of annotations in a small number of studies provides 

324 only a partial picture of the function of these proteins. As we have seen, the picture 

325 provided from high throughput experiments is mainly of: 1) subcellular localization cell 

326 fractionation and MS based localization and 2) developmental phenotypes. While these 

327 data are important, we should be mindful of this bias when examining protein function in 

328 the database, even those annotations deemed to be of high quality, those with experimen- 

329 tal verification. Furthermore, such a large bias in prior probabilities can adversely affect 

330 programs employing prior probabilities, as most machine-learning programs do. If the 

331 training set for these programs has included a disproportional number of annotations by 

332 high-throughput experiments, the results these programs provide will be strongly biased 

333 towards a few frequent and shallow GO terms. 

334 To remedy the bias created by high throughput annotations, the provenance of an- 

335 notations should be described in more detail by curators and curation software. Many 

336 function annotation algorithms rely on homology transfer as part of their pipeline to an- 

337 notate query sequences [1]IT9]. Knowing the annotation provenance, including the number 

338 of proteins annotated by the original paper can create less biased benchmarks or otherwise 

339 incorporate that information into the annotation procedure. The ECO ontology can be 

340 used to determine the source of the annotation, and the user or the algorithm can decide 

341 whether to rely upon any combinations of "throughputness" and experimental type. Of 

342 course, such approaches should be taken cautiously, as sweeping measures can cause the 

343 unintended loss of information. We hereby call upon the communities of annotators, com- 
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344 putational biologists and experimental biologists to be mindful of the phenomenon of the 

345 experimental biases described in this study, and to work to understand its implications 

346 and impact. 



347 Methods 

348 We used the UniProt-GOA database from December 2011. Data analyses were performed 

349 using Python scripts. The following tools were used in the analyses: Biopython [20] , mat- 

350 plotlib [21]. ECO terms classifying the proteins in the top 50 experiments were assigned 

351 to the proteins manually after reading the articles. All data and scripts are available on: 



http://github.com/idoerg/Uniprot-Bias/ and on http://datadryad.org (the latter upon 



353 acceptance). 



354 Use of GO evidence codes 

355 Proteins in UniProt-GOA are annotated with one or more GO terms using a procedure 

356 described in Dimmer et al. (2012). Briefly, this procedure consists of six steps which 

357 include sequence curation, sequence motif analyses, literature-based curation, reciprocal 

358 BLAST [22] searches, attribution of all resources leading to the included findings, and 

359 quality assurance. If the annotation source is a research article, the attribution includes 

360 its PubMed ID. For each GO term associated with a protein, there is also an evidence code 

361 which the curator assigns to explain how the association between the protein and the GO 

362 term was made. Experimental evidence codes include such terms as: Inferred by Direct As- 

363 say (IDA) which indicates that "a direct assay was carried out to determine the function, 

364 process, or component indicated by the GO term" or Inferred from Physical Interaction 

365 (IPI) which "Covers physical interactions between the gene product of interest and another 
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366 molecule." (All GO evidence code definitions were taken from the GO site, geneontol- 

367 ogy.org.) Computational evidence codes include terms such as Inferred from Sequence or 

368 Structural Similarity (ISS) and Inferred from Sequence Orthology (ISO). Although the ev- 

369 idence in computational evidence codes is non-experimental, the proteins annotated with 

370 these evidence codes are still assigned by a curator, rendering a degree of human oversight. 

371 Finally, there are also computational, non-experimental evidence codes, the most preva- 

372 lent being Inferred from Electronic Annotation (lEA) which is "used for annotations that 

373 depend directly on computation or automated transfer of annotations from a database" . 

374 lEA evidence means that the annotation is electronic, and was not made or checked by a 

375 person. Different degrees of reliability are associated with different evidence codes, with 

376 experimental codes generally considered to be of higher reliability than non-experimental 



377 codes. (For details see: http://www.ebi.ac.uk/GOA/ElectronicAnnotationMethods) 



378 Quantifying GO-term Information 

379 We used two methods to quantify the information given by GO terms. First we used 

380 edge counting , where the information contained in a term is dependent on the edge 

381 distance of that term from the root. The term "catalytic activity" (one edge distance 

382 from the ontology root node) would be less informative than "hydrolase activity" (two 

383 edges) and the latter will be less informative than "haloalkane dehalogenase activity" 

384 (five edges). We therefore counted edges from the ontology root term to the GO term 

385 to determine term information. The larger the number of edges, the more specific -and 

386 therefore informative- is the annotation. In cases where several paths lead from the root 

387 to the examined GO term, we used the minimal path. We did so for all the annotating 
338 articles split into groups by the number of proteins each article annotates. 

389 While edge counting provides a measure of term- specificity, this measure is imperfect. 
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390 The reason is that each of the three GO ontologies is constructed as a directed acychc 

391 graph (DAG) where different areas of the GO DAG have different connectivities, and terms 

392 may have different depths unrelated to the intuitive specificity of a term. For example "D- 

393 glucose transmembrane transporter activity", (GO:0055056) is 10 terms deep, while "L- 

394 tryptophan transmembrane transporter activity", (GO:0015196) is fourteen terms deep. 

395 It is hard to discern whether these differences are meaningful. For this reason, information 

396 content, the logarithm of the inverse of the GO term frequency in the corpus, is generally 

397 accepted as a measure of GO term information content |23|l2l]. To account for the possible 

398 bias created by the GO-DAG structure, we also used the log-frequency of the terms in 

399 the experimentally annotated proteins in Uniprot-GOA. However, it should be noted that 

400 the log-frequency measure is also imperfect because, as we see throughout this study, a 

401 GO term's frequency may be heavily influenced by the top annotating articles, injecting 

402 a circularity problem into the use of this metric. Since no single metric for measuring the 

403 information conveyed by a GO term is wholly satisfactory, we used both edge-counting 

404 and information-content in this study. 

405 Annotation Consistency 

406 To examine annotation consistency, we employed the following method: given a protein 

407 P, let G be the terminal (leaf) GO terms gi, g2, . . . , Qm that annotate that protein in all 

408 top-50 articles for a single ontology O G {BPO, MFO, CCO}. The count of each of these 

409 GO terms per protein per ontology is rii, n2, . . . , rim with rii being the number of times 

410 GO term gi annotates protein P. 

411 The number of total annotations for a protein in an ontology is rii ■ The maximum 

412 annotation consistency for protein P in ontology O < kp^o < 1 is calculated as: 
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max{ni,n2, . . . 
kp,o = for max{ni, n2,...,n^)>2 

413 For example, the protein "Oleate activated transcription factor 3" (UniProtID: P36023) 

414 in S. cerevisiae is annotated four times by three articles using the Cellular Component 

415 ontology: 

PubMedID UniProt ID Ontology GO term description 

14562095 P36023 CCO GO:0005634 nucleus 

14562095 P36023 CCO GO:0005737 cytoplasm 

16823961 P36023 CCO GO:0005739 mitochondrion 

14576278 P36023 CCO GO:0005739 mitochondrion 



416 The annotation consistency for P36023 is therefore the maximum count of identical 

417 GO terms {mitochondrion, 2), divided by the total number of annotations, 4: 0.5. 

418 When choosing a measure for annotation consistency, we favored a simple and inter- 

419 pretable measure. We therefore examined identity among leaf terms only, rather than 

420 use a more complex comparison of multiple subgraphs in the GO ontology DAG (Di- 

421 rected Acyclic Graph). Doing so without manual curation is unreliable, and may skew 

422 the perception of similarity |25j . 
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Figure 1. Distribution of the number of proteins annotated per article. 

X-axis: number of annotating articles. Y-axis: number of annotated proteins. The 
distribution was found to be logarithmic with a significant (i?^ = 0.72; p < 1.10 x 10"^^) 
linear fit to the log-log plot. The data came from 76137 articles annotating 256033 
proteins with GO experimental evidence codes, in Uniprot-GOA 12/2011. 
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Figure 2. Relative contribution of top-50 articles to the annotation of major 
model organisms. The length of each bar represents the percentage of proteins 
annotated by the top-50 articles in a given organism by a given GO term. GO terms 
that are present in more than one species are highlighted. 
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Figure 3. Redundancy in proteins described by the top-50 articles. A circle 
represents the sum total of articles annotating each organism. Each colored arch is 
composed of all the proteins in a single article. A line is drawn between any two points 
on the circle if the proteins they represent have 100% sequence identity. A black line is 
drawn if they are annotated with a different ontology (for example, in one article the 
protein is annotated with the MFO, and in another article with BPO); a red line if they 
are annotated in the same ontology. Example: S. pombe is described by two articles, one 
with few protein (light arch on bottom) and one with many (dark arch encompassing 
most of circle). Many of the same proteins are annotated by both articles. See Table |2] 
for numbers. 
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Figure 4. Information provided by articles depending on the number of 
proteins the articles annotate. Articles are grouped into cohorts: 1: one protein 
annotated by article; < 10: more than 1, up to 10 annotated; < 100: more than 10, less 
than 100 annotated; > 100: 100 or more proteins annotated per article. Blue bars: 
Molecular Function ontology; Green bars: Biological Process ontology; Red bars: 
Cellular Component ontology. Information is gauged by A: Information Content and B: 
GO depth. See text for details. 
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619 Tables 



Table 1. Annotation Cohorts 



Articles annotating the 
following number of pro- 
teins 


1 


1 < n< 10 


10<n< 100 


n > 100 


SUM 


Number of proteins an- 
notated 


20699 


46383 


26485 


31411 


124978 


Number of annotating 
articles 


41156 


32201 


2672 


108 


76137 


Percent of proteins an- 
notated 


16.56 


37.11 


21.19 


25.13 


100 


Percent of annotating 
articles 


54.09 


42.32 


3.51 


0.14 


100 



Number of proteins and annotating articles assigned to each article annotation cohort. 
Columns: 1: articles annotating a single protein (singletons); 1 < n < 10 articles 
annotating more than 1 and less than 10 proteins (low throughput); 10 < n < 100: 
medium throughput; n > 100: articles annotating 100 proteins and more (high 
throughput). As can be seen, high-throughput articles comprise 0.14% of the total 
articles used for experimental annotations, but annotate 25.13% of the proteins in 
UniProt-GOA. 
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Table 2. Sequence Redundancy in Top-50 Annotating Articles 



Species 


num. 


num. 


Clusters 


% redun- 


Mean 




articles 


prot 


at 100% 


dancy 


genes/ 
cluster 


C. elegans 


12 


8416 


3338 


60 


3.74 


A. thaliana 


16 


8879 


4694 


47 


3.92 


M. musculus 


3 


4220 


2273 


46 


2.75 


M. tuberculosis 


2 


2351 


1702 


28 


2.22 


S. cerevisiae 


5 


3542 


2550 


28 


2.33 


H. sapiens 


4 


5593 


4509 


19 


2.36 


D. melanogaster 


3 


1217 


1003 


18 


2.17 


S. pombe 


2 


4502 


4281 


5 


2.00 



Species: annotated species; num. articles number of annotating articles; num. prot: 
number of proteins annotated by top-50 articles for that species; Clusters at 100%: 
number of clusters of 100% identical proteins; % redundancy: the product of column 4 
by column 3: this is the percentage of proteins annotated more than once for a given 
species in the top 50 articles; Mean genes/cluster: the mean number of genes per 
cluster, for clusters having more than a single gene. 
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Table 3. Annotation Consistency in Top 50 articles 



Species 


Ont. 


num prot 


mean kp^o 


stdv 


stderr 


num 
articles 


num 
terms 


A. thaliana 


ceo 


1941 


0.251 


0.328 


0.007 


15 


18 


C. elegans 


BPO 


1847 


0.388 


0.239 


0.006 


12 


41 


D. melanogaster 


BPO 


76 


0.086 


0.22 


0.025 


3 


8 


D. melanogaster 


ceo 


81 


0.068 


0.234 


0.026 


3 


5 


H. sapiens 


ceo 


167 


0.285 


0.365 


0.028 


2 


20 


M. musculus 


ceo 


807 


0.832 


0.291 


0.01 


3 


2 


S. cerevisiae 


ceo 


744 


0.759 


0.379 


0.014 


4 


15 


B. tuberculosis 


ceo 


532 


0.309 


0.41 


0.018 


2 


3 



Species: annotated species; Ontology: annotating GO ontology; num prot: number 
of annotated proteins in that species & ontology that are annotated by more than one 
paper, mean, stdv, stderr: mean number of consistent annotations for a protein in 
that species and ontology, standard deviation from the mean and standard error, num 
articles: number of annotating articles num terms number of annotating terms. 
Annotations by less than two articles or two terms (or both) for the same 
protein/ontology combination have been omitted. 
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Table 4. Fraction of Proteins Exclusively Annotated by High Throughput 
Studies 



Taxon ID 


Taxon 


XHT 


Total Proteins 


%XHT 


284812 


Schizosaccharomyces pombe 


2781 


4507 


61.704 


1773 


Bacillus tuberculosis 


1224 


2317 


52.8269 


6239 


Caenorhabditis elegans 


2493 


5302 


47.02 


9606 


Homo sapiens 


4016 


11521 


34.8581 


44689 


Dictyostelium discoideum 


425 


1256 


33.8376 


3702 


Arabidopsis thaliana 


3199 


10153 


31.5079 


237561 


Candida albicans SC5314 


327 


1243 


26.3073 


10090 


LK3 transgenic mice 


2567 


22068 


11.6322 


7227 


Drosophila melanogaster 


735 


7501 


9.7987 


559292 


Saccharomyces cerevisiae 


439 


5086 


8.6315 


83333 


Escherichia coli K-12 


83 


1606 


5.1681 


7955 


Brachidanio rerio 


117 


4633 


2.5254 


10116 


Buffalo rat 


11 


4634 


0.2374 



Taxon ID: NCBl Taxon ID number; Species: annotated species; XHT: number of 
proteins exclusively annotated by high-throughput experimental studies (100 or more 
proteins annotated per study); Total proteins: Total number of experimentally 
annotated proteins in that species. %XHT: percentage of proteins in that species that 
are annotated exclusively by HT studies. 
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620 Supplementary Material Legends 

621 Table ISlt The top 50 annotating articles. 

622 N: article rank; Proteins: number of proteins annotated in this article; Annotations: 

623 number of annotating GO terms; Species: annotated species; ref. annotating article; 

624 MFO/BPO/CCO: number of proteins annotated in the Molecular Function, Biological 

625 Process and Cellular Component ontologies, respectively. 

626 

627 Table IS2[ The Top-50 studies and the ECO terms we have assigned to them. 

628 PMID: Articles' PubMed ID; ECO terms/ECO ID's: terms and ID's we assigned to 

629 the articles. 

630 

631 Table IS3t ECO terms were assigned by us to the top-50 annotating papers. 

632 The table entries are ranked by the frequency of the assignments, i.e. 27 papers are as- 

633 signed with term ECO:0000160, 21 were assigned ECO:0000004, etc. Entries in boldface 

634 are for computational methods, which were used in many papers in combination with 

635 experimental methods to assign function. Table [S2] lists the ECO terms. 
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Table SI. Top 50 Annotating Articles 



N 


Proteins 


Annotations 


Species 


ref. 


MFO 


BPO 


ceo 


1 


4937 


11050 


H. sapiens 


m 








11050 


2 


4247 


7046 


S. pombe 


[26] 








7046 


3 


2412 


2412 


H. sapiens 


m 








2412 


4 


1791 


5918 


C. elegans 


[28] 





5918 





5 


1406 


1863 


S. cerevisiae 


[29] 








1863 


6 


1251 


1251 


A. thaliana 


m 








1251 


7 


1205 


1476 


C. elegans 


m 





1476 





8 


1186 


1213 


M. musculus 


[32] 








1213 


9 


1136 


1136 


A. thaliana 


[33] 








1136 


10 


1101 


2269 


C. elegans 


m 





2269 





11 


1043 


1365 


M. tuberculosis 


m 








1365 


12 


1041 


1041 


A. thaliana 


m 








1041 


13 


865 


1533 


C. elegans 


[36] 





1533 





14 


845 


845 


S. cerevisiae 


m 








845 


15 


784 


784 


A. thaliana 


[38] 








784 


16 


735 


735 


M. tuberculosis 


[39] 








735 


17 


724 


882 


A. thaliana 


m 








882 


18 


634 


634 


A. thaliana 


m 








634 


19 


613 


613 


Mycobacter sp. 


m 





613 





20 


607 


661 


C. elegans 


m 





659 


2 


21 


577 


577 


A. thaliana 


m 








577 



Continued on next page 
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636 The top 50 annotating articles. N: article rank; Proteins: number of proteins 

637 annotated in this article; Annotations: number of annotating GO terms; Species: 

638 annotated species; ref. annotating article; MFO/BPO/CCO: number of proteins 

639 annotated in the Molecular Function, Biological Process and Cellular Component 

640 ontologies, respectively. 
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Table S2. ECO Terms Assigned to Top-50 Papers 



PMID 

18029348 


Ref 

M 


ECO terms/ECO ID's 

imaging assay evidence/ECO:0000324 immunofluorescence evi- 
dence/ECO:0000007 immunolocalization evidence/ECO:0000087 


16823372 


[26] 


imaging assay evidence/ECO: 0000324 yellow fluorescent protein fu- 
sion protein localization evidence/ECO:0000128 enzyme inhibition 
experiment evidence/ECO:0000184 


18614015 


m 


imaging assay evidence/ECO:0000324 protein separation fol- 
lowed by fragment identiflcation evidence/ECO:0000160 
sequence similarity evidence/ECO :0000044 cell fractiona- 
tion evidence/ECO:0000004 GFP fusion protein localization 
evidence/ECO:0000126 computational combinatorial evi- 
dence/ECO: 0000053 motif similarity evidence/ECO: 0000028 
targeting sequence prediction evidence/ECO:0000081 protein 
BLAST evidence/ECO:0000208 


14551910 


m 


imaging assay evidence/ECO: 0000324 RNAi evi- 
dence/ECO: 00000 19 loss-of-function mutant phenotype evi- 
dence/ECO:0000016 nucleotide BLAST evidence/ECO:0000207 
sequence alignment evidence/ECO:0000200 


14562095 


[29] 


imaging assay evidence/ECO: 0000324 GFP fusion protein 
localization evidence/ECO:0000126 fusion protein localiza- 
tion evidence/ECO:0000124 affinity chromatography evi- 
dence/ECO:0000079 



Continued on next page 
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PMID 


Ref 


ECO terms/ECO ID's 


18431481 


[30j 


protein separation rollowed by iragment identmcation ev- 
idence/ECO:0000160 targeting sequence prediction evi- 
dence/ECO:0000081 cell fractionation evidence/ECO:0000004 
sequence similarity evidence/ELyO:UUL)UU44 imported miorma- 
tion/ECO:0000311 


15791247 


m 


imaging assay evidence/ECO: 0000324 RNAi evi- 
dence/ECO: 00000 19 loss-of-function mutant phenotype evi- 
dence/ECO:0000016 protein BLAST evidence/ECO:0000208 


14651853 


[32] 


protein separation followed by fragment identification evi- 
dence/ECO:0000160 cell fractionation evidence/ECO:0000004 
targeting sequence prediction evidence/EbU:0000081 se- 
quence similarity evidence/ECO: 0000044 protein BLAST evi- 
dence/ECO:0000208 nucleotide BLAST evidence/ECO:0000207 
Affymetrix array experiment evidence/ECO:0000101 imported 
miormation/ ii/UU.UUUUoii 


17317660 


[33] 


protein separation followed by fragment identification ev- 
idence/ECO:0000160 cell fractionation evidence/ECO:0000004 
transmembrane domain prediction evidence/ECO: 0000083 se- 
quence similarity evidence/ECO: 0000044 



Continued on next page 
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PMID 


Ref 


ECO terms/ECO ID's 


12529635 


m 


imaging assay evidence/ECO: 0000324 RNAi evi- 
(161106/^(^0:0000019 loss-oi-iunctioii mutant pnenotyp6 6vi- 
d6nc6/EC0: 00000 16 motif similarity 6vid6nc6/EC0: 0000028 
prot6in BLAST 6vid6nc6/ECO:0000208 nucl6otid6 BLAST 
6vid6nc6/ECO:0000207 computational combinatorial 6vi- 
d6nc6/ECO:0000053 


15525680 


m 


protein separation followed by fragment identification ev- 
id6nce/ECO:0000160 cell fractionation evidence/ECO:0000004 
transmembrane domain prediction evidence/EOO: 0000083 se- 
quence similarity evidence/ECO:0000044 computational com- 
binatorial evidence/ECO:0000053 biological system reconstruc- 
tion/ECO:0000088 imported information/ECO:0000311 protein 
BLAST evidence/ECO:0000208 


21166475 


[35] 


protein separation followed by fragment identification evi- 
dence/ECO: 0000 160 cell fractionation evidence/ECO: 0000004 se- 
quence similarity evidence/EL'O:0000044 computational combina- 
torial evidence/ECO:0000053 imported information/ECO:0000311 
transmembrane domain prediction evidence/ECO: 0000083 se- 
quence alignment evidence/ECO:0000200 motif similarity evi- 
dence/ECO:0000028 


15489339 


[36] 


imaging assay evidence/ECO: 0000324 RNAi evi- 
dence/ECO: 00000 19 loss-of- function mutant phenotype evi- 
dence/ECO:0000016 nucleotide BLAST evidence/ECO:0000207 



Continued on next page 
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PMID 


Ret 


ECO terms/JjjCO ID s 


16823961 


m 


protein separation followed by fragment identification ev- 
idence/ECO:0000160 cell fractionation evidence/ECO:0000004 
sequence similarity evidence/ECO:0000044 imported informa- 
tion/ECO:0000311 


21533090 


[38] 


protein separation followed by fragment identification evi- 
dence/ECO:0000160 cell fractionation evidence/ECO:0000004 
sequence similarity evidence/ECO:0000044 imported in- 
tormation/Eu(J:0000311 computational combmatorial evi- 
dence/ECO: 0000053 transmembrane domain prediction evi- 
dence/ECO:0000083 sequence alignment evidence/ECO:0000200 
motif similarity evidence/ECO:0000028 targeting sequence predic- 
tion evidence/ELvL):0000081 


14532352 


m 


protein separation followed by fragment identification evi- 
dence/ECO:0000160 cell fractionation evidence/ECO:0000004 se- 
quence similarity evidence/ECO:0000044 transmembrane domain 
prediction evidence/ECO:0000083 


20061580 


m 


protein separation followed by fragment identification evi- 
dence/ECO:0000160 cell fractionation evidence/ECO:0000004 
sequence similarity evidence/ECO:0000044 transmem- 
brane domain prediction evidence/ECO:0000083 imported 
information/ECO:0000311 targeting sequence predic- 
tion evidence/ECO: 0000081 protein expression level evi- 
dence/ECO:0000046 



Continued on next page 
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PMID 


Ref 


ECO terms/ECO ID's 


15028209 


i41J 


protein separation followed by fragment identification evi- 
dence/ECO:0000160 cell fractionation evidence/ECO:0000004 
sequence similarity evidence/ECO: 0000044 targeting sequence 
prediction evidence/ECO:0000081 Affymetrix array exper- 

j_ * 1 Itt^ r\r\r\r\ '\ r\-\ ±. ' * i i * 

iment evidence/EuL):0000101 protem expression level evi- 
dence/ECO:0000046 protein BLAST evidence/ECO:0000208 
computational combinatorial evidence/ECO: 0000053 motif simi- 
larity evidence/ECO:0000028 transmembrane domain prediction 
evidence/Eb(J:0000083 


12657046 


m 


mutant phenotype evidence/ECO: 00000 15 nucleic acid hybridiza- 
tion evidence/ECO:0000026 imported information/ECO:0000311 
sequence similarity evidence/EuU: 0000044 combinatorial evi- 
dence/ECO:0000212 


17704769 


m 


imaging assay evidence/ECO: 0000324 RNAi evi- 
dence/ECO:0000019 loss-of-function mutant phenotype evi- 
dence/ECO:0000016 


17432890 


m 


protein separation followed by fragment identification evi- 
dence/ECO:0000160 cell fractionation evidence/ECO:0000004 
sequence similarity evidence/ECO: 0000044 transmembrane 
domain prediction evidence/ECO:0000083 imported infor- 
mation/ECO:0000311 targeting sequence prediction evi- 
dence/ECO:0000081 protein BLAST evidence/ECO:0000208 
computational combinatorial evidence/ECO: 0000053 



Continued on next page 
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PMID 


Ref 


ECO terms/ECO ID's 


11231151 


m 


imaging assay evidence/ECO: 0000324 RNAi evi- 
(161106/^(^0:0000019 loss-oi-iunctioii mutant pnenotyp6 6vi- 
d6nc6/ECO:0000016 


17417969 


m 


imaging assay 6vid6nc6/EC0: 0000324 RNAi 6vi- 
d6nc6/ECO:0000019 loss-of-function mutant ph6notyp6 6vi- 
d6nc6/E(_yO:0000016 


14576278 




prot6in s6paration followed by fragment identification evi- 
d6nc6/ECO:0000160 cell fractionation evidence/ECO:0000004 se- 
quence similarity evidence/ECO:0000044 transmembrane domain 
prediction evidence/EuO:0000083 


16429126 


m 


protein separation followed by fragment identification evi- 
dence/ECO:0000160 sequence similarity evidence/ECO:0000044 
affinity chromatography evidence/ECO: 0000079 protein BLAST 
evidence/ELO:0000208 imported information/EOO:0000311 


21529718 


m 


imaging assay evidence/ECO: 0000324 RNAi evi- 
dence/ECO: 00000 19 loss-of-function mutant phenotype 
evidence/ECO:0000016 computational combinatorial evi- 
dence/ECO:0000053 


11/56614 


nroi 

m 


Lrrr lusion protein localization evidence/lijL'O:0000126 yellow nu- 
orescent protein fusion protein localization evidence/ECO: 0000 128 
imaging assay evidence/ECO:0000324 motif similarity evi- 
dence/ECO:0000028 protein BLAST evidence/ECO:0000208 nu- 
cleotide BLAST evidence/ECO:0000207 



Continued on next page 
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PMID 


Ref 


ECO terms/ECO ID's 


17644812 


m 


protein separation followed by fragment identification evi- 
dence/ECO:0000160 cell fractionation evidence/ECO:0000004 se- 
quence similarity evidence/ECO:0000044 transmembrane do- 
main prediction evidence/ECO:0000083 targeting sequence pre- 
diction evidence/ECO:0000081 computational combinatorial evi- 
dence/E(_y(J:UUUUU53 


16618929 


m 


protein separation followed by fragment identification evi- 
dence/ECO: 0000 160 cell fractionation evidence/ECO: 0000004 se- 
quence similarity evidence/ECO:0000044 transmembrane domain 
prediction evidence/ECO:0000083 


18433294 


m 


protein separation followed by fragment identification evi- 

1 /t — > /"'I / \ r\r\f\i~\-^ c r\ n c j_ * j_ ■ ■ i /t — * \ / \ r\r\ r\r\ r\r\ a 

dence/ECO:0000160 cell fractionation evidence/ECO:0000004 se- 
quence similarity evidence/ECO:0000044 imaging assay evi- 
dence/ECO:0000324 RNAi evidence/ECO:0000019 loss-of-function 
mutant phenotype evidence/ECO:0000016 immunofluorescence ev- 
idence/ECO:0000007 


1 71 1 n 1 n 
1 / I0IUI9 


mi 


protein separation followed by fragment identification evi- 
dence/ECO:0000160 cell fractionation evidence/ECO:0000004 
sequence similarity evidence/ECO: 0000044 imported infor- 
mation/ECO:0000311 transmembrane domain prediction evi- 
dence/ECO:0000083 



Continued on next page 
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PMID 


Ref 


ECO terms/ECO ID's 


14671022 


|52j 


protein separation lollowed by fragment identmcation evi- 
dence/ECO:0000160 cell fractionation evidence/ECO:0000004 
sequence similarity evidence/ECO:0000044 protein BLAST 
evidence/EL-UiUlJUUzUo targeting sequence prediction evi- 
dence/ECO:0000081 


12529643 


m 


imaging assay evidence/ECO: 0000324 RNAi evi- 
dence/ELy(J:0000019 loss-of-function mutant pfienotype evi- 
dence/ECO:0000016 


12445391 


m 


imaging assay evidence/ECO: 0000324 RNAi evi- 
dence/ECO:0000019 loss-of-function mutant phenotype evi- 
dence/ECO:0000016 BLAST evidence/ECO: 0000206 


15539469 


[53] 


protein separation followed by fragment identification evi- 
dence/ECO:0000160 cell fractionation evidence/ECO:0000004 se- 
quence similarity evidence/EC(J:0000044 targeting sequence pre- 
diction evidence/ECO: 0000081 transmembrane domain prediction 
evidence/ECO:0000083 motif similarity evidence/ECO:0000028 
protein BLAST evidence/ECO:0000208 computational combinato- 
rial evidence/iivv^U:UUUUUoo 


12865426 




protein separation followed by fragment identification evi- 
dence/ECO:0000160 cell fractionation evidence/ECO:0000004 se- 
quence similarity evidence/ECO:0000044 transmembrane domain 
prediction evidence/ECO:0000083 



Continued on next page 
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PMID 


Ret 


ECO terms/JjjCO ID s 


16189514 


I55j 


yeast 2-hybrid evidence/ECO: 0000068 imaging assay evi- 
dence/ECO: 0000324 motif similarity evidence/ECO: 0000028 
co-purification evidence/ECO:0000022 combinatorial evi- 
dence/EC-UiOOOOzlz 


20422638 


m 


protein separation followed by fragment identification ev- 
idence/ECO:0000160 cell fractionation evidence/ECO:0000004 
sequence similarity evidence/ECO: 0000044 combinatorial evi- 
dence/ECO:0000212 


12938931 




protein separation followed by fragment identification evi- 
dence/ECO:0000160 cell fractionation evidence/ECO:0000004 se- 
quence similarity evidence/ECO: 0000044 nucleotide BLAST ev- 
idence/Eu(J:0000/07 imported information/EL'(J:0000311 trans- 
membrane domain prediction evidence/ECO: 0000083 


16336044 


[58] 


imaging assay evidence/ECO: 0000324 RNAi evi- 
dence/ECO: 00000 19 loss-of- function mutant phenotype evi- 
dence/ECO:0000016 


18633119 


[59] 


protein separation followed by fragment identification ev- 
idence/ECO:0000160 cell fractionation evidence/ECO: 0000004 
Western blot evidence/ ii/oU;UUUUiiz 


11914276 


m 


imaging assay evidence/ECO:0000324 immunofiuorescence evi- 
dence/ECO:0000007 epitope-tagged protein immunolocalization 
evidence/ECO:0000092 transmembrane domain prediction evi- 
dence/ECO:0000083 imported information/ECO:0000311 



Continued on next page 
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PMID 


Ret 


ECO terms/JjjCO ID s 


11099033 


[10] 


imaging assay evidence/ECO: 0000324 RNAi evi- 
dence/ECO:0000019 loss-of-function mutant phenotype evi- 
dence/ECO:0000016 protein BLAST evidence/ECO:0000208 
computational combmatoriai evidence/Eu(J:0000053 


11099034 


m 


imaging assay evidence/ECO: 0000324 RNAi evi- 
dence/EL'(J:0000019 loss-oi-iunction mutant phenotype evi- 
dence/ECO:0000016 nucleotide BLAST evidence/ECO:0000207 
protein BLAST evidence/ECO:0000208 


11591653 


[62] 


hybrid interaction evidence/ECO: 0000025 imaging assay evi- 
dence/ECO:0000324 


■1 f r\c\ A c\ 

16502469 


[63] 


protein separation loUowed by iragment identmcation evi- 
dence/ECO:0000160 sequence similarity evidence/ECO:0000044 
protein BLAST evidence/ECO:0000208 Northern assay evi- 

J l'V7^(~^f~\ f\f\f\f\-\ n/^ X • J- • 1 1 

dence/EL'(J:0000106 reverse transcription polymerase chain reac- 
tion transcription evidence/ECO:0000108 




[64] 


microarray RNA expression level evidence/ECO:0000104 sequence 
orthology evidence used in manual assertion/ECO:0000266 motif 
similarity evidence/ECO: 0000028 


17412918 


m 


RNAi evidence/ECO:0000019 loss-of-function mutant phenotype 
evidence/ECO:0000016 imaging assay evidence/ECO: 0000324 



Continued on next page 
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PMID 


Ref 


ECO terms/ECO ID's 


18981222 


m 


protein separation followed by fragment identification evi- 

1 / T~> /■' i / \ r\r\r\r\ -X r\ * *1 *i * 1 / t~> /^~\ / \ r\r\f\f\r\ A A 

dence/ECO:0000160 sequence similarity evidence/ECO:0000044 
protein BLAST evidence/ECO: 0000208 in vitro as- 
say evidence/ECO:0000181 affinity chromatography evi- 
dence/ECO:0000079 imaging assay evidence/ECO:0000324 
mutant phenotype evidence/ECO: 00000 15 


16287169 


[eg 


protein separation followed by fragment identification evi- 
dence/ECO:0000160 sequence similarity evidence/ECO:0000044 
transmembrane domain prediction evidence/EUL): 0000083 
sequence alignment evidence/ECO: 0000200 computational 
combinatorial evidence/ECO:0000053 motif similarity ev- 
idence/ECO:0000028 targeting sequence prediction evi- 
dence/ECO:0000081 



641 The Top-50 studies and the ECO terms we have assigned to them. PMID: Articles' 

642 PubMed ID; ECO terms/ECO ID's: terms and ID's we assigned to the articles. 



Table S3. Count of ECO terms in top-50 papers 



N 


ECO term 


ECO ID 


Articles 


1 


protein separation followed by fragment iden- 
tification evidence 


ECO:0000160 


27 


2 


sequence similarity evidence 


ECO:0000044 


27 


3 


imaging assay evidence 


ECO:0000324 


24 


4 


cell fractionation evidence 


ECO:0000004 


23 


5 


transmembrane domain prediction ev- 
idence 


ECO:0000083 


17 


6 


loss-of-function mutant phenotype evidence 


ECO:0000016 


15 


7 


protein BLAST evidence 


ECO:0000208 


15 


8 


RNAi evidence 


ECO:0000019 


15 


9 


imported information 


ECO:0000311 


13 


10 


computational combinatorial evidence 


ECO:0000053 


11 


11 


targeting sequence prediction evidence 


ECO:0000081 


11 


12 


motif similarity evidence 


ECO:0000028 


10 


13 


nucleotide BLAST evidence 


ECO:0000207 


7 


14 


sequence alignment evidence 


ECO:0000200 


4 


15 


GFP fusion protein localization evidence 


ECO:0000126 


3 


16 


immunofluorescence evidence 


ECO:0000007 


3 


17 


aflinity chromatography evidence 


ECO:0000079 


3 


18 


computational combinatorial evidence 


ECO:0000053 


2 


19 


Affymetrix array experiment evidence 


ECO:0000101 


2 
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IN 


ii/i^LJ term 


1 J \ ( J 1 1 J 


Articles 


20 


protein expression level evidence 


ECO:0000046 


2 




mutant plienotype evidence 


rlvljU.UUUUUiO 


o 
Z 


22 


combinatorial evidence 


ECO:0000212 


2 


26 


co-purmcation evidence 


bLAJ:0000022 


-1 
i 


24 


epitope-tagged protein immunolocalization 
evidence 


bLU:00000y2 


i 


o 

ZO 


sequence orthology evidence used in 
manual assertion 


iIjL'U.UUUUzoo 


i 


ZO 


YFP fusion protein localization evidence 


UvUU.UUUUizo 


o 
Z 


z ^ 


in vitro assay evidence 


rvL-U-UDUUioi 


i 


ZO 


biological system reconstruction 


rlvljU.UUUUUoo 


i 


on 

zy 


reverse transcription polymerase chain reac- 
tion transcription evidence 


ilvljU.UUUUiUo 


1 
i 


Qn 


Northern assay evidence 


rvL-U-UDUUiUo 


1 
i 




Western blot evidence 


TTT^o-nnnni i o 
ll/UU.UUUUiiZ 


1 
i 


oZ 


microarray RNA expression level evidence 


ll/UU.UUUUiU4 


1 
i 


QQ 
OO 


fusion protein localization evidence 


TTT^o-nnnni o/i 
ll/UU.UUUUiZ4 


1 
i 


Q /I 


dLi A.0 1 eviaence 


HvoU.UUUUzUd 


1 
i 


OO 


nucieic aciQ nyuiiQization eviuence 


Il/*._y^J.UUUUUZO 


1 
i 


36 


enzyme inhibition experiment evidence 


ECO:0000184 


1 


37 


immunolocalization evidence 


ECO:0000087 


1 


38 


hybrid interaction evidence 


ECO:0000025 


1 


39 


yeast 2-hybrid evidence 


ECO:0000068 


1 



55 



643 ECO terms were assigned by us to the top-50 annotating papers. The table entries are 

644 ranked by the frequency of the assignments, i.e. 27 papers are assigned with term 

645 ECO:0000160, 21 were assigned ECO:0000004, etc. Entries in boldface are for 

646 computational methods, which were used in many papers in combination with 

647 experimental methods to assign function. TabldS2] lists the ECO terms. 



